Credit card Dataset for clustering

EDA

Drop rows with null values and drop CUST_ID column

Draw a box plot to highlight the distribution of outliers in each column.

Now we'll see how each column in the dataframe is distributed.

To deal with the skewness of data in columns, we use Log Transformation.

Display the difference in distribution after log transformation

Then we look at the dataset's features to see the correlation between them.

To get the most significant components, we use PCA as a dimensionality reduction method.

Now, we have two approaches to cluster the data:

1- Perform embedding using TSNE Algorithm, then cluster the result from TSNE using any of the clustering algorithms.

2- Perform clustering for the data in high dimension which is not suitable for some algorithms then embedding the data using TSNE and visualize the clustering result using labels we have got before applying TSNE.

Now, we will try each of the twwo approaches and see which one will give better clusters

Use TSNE Algorithm fior embedding (Moving from high-dimension space to Low-dimension one)

First Approach: Cluster result of TSNE using all clustering algorithms

KMeansClustering Algorithm

From Elbow method if we choose 5 number of clusters

We will try different number of clusters to get the best one using silhouette_score as the metric that measures the accuracy of clustering

AgglomerativeClustering

Choose number of clusters that give the best silhouette_score

DBSCAN

Here we will detect anomalies through this algorithm for result of TSNE

Only one point is detected as anomaly

Expectation-Maximization (EM) Algorithm

Isolated Random Forest

Second Approach: Cluster the result of PCA first then use TSNE with labels we have obtained from this clustering

KMeansClustering Algorithm

Here we can see that KMeans has better performance in clustering with first approach.

AgglomerativeClustering

There is no big difference between two approaches here with this algorithm

DBSCAN

In this approach, the number of anomalies is increased remarkably.

Expectation-Maximization (EM) Algorithm

From the clustering above, I can see that the first approach is better than this one.

Isolated Random Forest

I can't decide here which one is better in anomaly detection but may be it depends on the application in which we use this algorithm

I found this link while I was searching in that area https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Try Different preprocessing (Use Kernel PCA instead of PCA)

This Will be in a separate notebook with name "Credit Card Clustering + Kernel PCA.ipynb"

Try Another preprocessing

We will use only one algorithm "KMeans" for clustering in the following part.

1- Remove nulls

2- Drop CUST_ID column

2- Robust Scaler which is suitable for dataserts with skewed distributions and outliers because it transforms the data based on the median and quantile

KMeansClustering Algorithm

With only Robust scaling Data is not well clustered so let's try robust + pca

Robust Scaling + PCA

This preprocessing failed to cluster the data well.

Robust Scaler + Log Transformation + PCA

Try Another Preprocessing

Standard scaler + PCA

TSNE projection of df3_PCA

Also here the data is not well clustered

Log Transformation + Standard Scaler + PCA

Bad performance also with this mixture of preprocessing + KMeans

I find that the best choice in this notebook is for Log Transformation after removing rows with null + PCA